Druid SIP
Druid SIP is the telephony voice gateway for Druid Voice. It enables organizations to seamlessly connect AI agents to SIP-based VoIP infrastructure, allowing callers to interact with AI Agents through existing telephony environments without requiring changes to their current voice architecture.
Acting as the core bridge between telephony systems and the Druid AI platform, the gateway manages:
- Call Control. Streamlined call establishment and termination.
- Audio Processing. Real-time audio streaming and voice interaction management.
- Security. Robust security controls for voice data.
Supported Telephony Environments
- Contact centers and PBXs
- SIP trunks and Session Border Controllers (SBCs)
Druid SIP allows organizations to embed conversational AI directly into their voice ecosystems, maximizing the value of current telephony investments.
Core Capabilities
Druid SIP provides robust session, media, and security management:
Session & Media Management
- SIP Signaling & Session Management. Handles call setup, teardown, and routing.
- RTP Media Handling. Ensures stable bidirectional Real-time Transport Protocol (RTP) audio stream delivery.
- DTMF Processing. Supports dual-tone multi-frequency (DTMF) key inputs.
- Voice Activity Detection (VAD) & Barge-in. Enables natural, fluid conversations by detecting when a user speaks or interrupts.
- Call recording. Supports the capturing and logging of voice sessions for compliance, quality assurance, and training purposes.
Security & Authentication
Enforces zero-trust access boundaries using:
- IP Allow-list Security. Restricts access to trusted networks.
- SIP Token Authentication. Validates endpoint identity securely.
- Mutual TLS (mTLS) Session Authentication. Enforces two-way cryptographic verification for all traffic.
Druid SIP Architecture
The Druid SIP gateway is deployed in the Druid Cloud and exposed through a public FQDN, which telephony platforms use as their primary SIP destination.
The following diagram illustrates how Druid SIP connects SIP-based voice infrastructure to the Druid AI Platform.
Telephony Infrastructure / Caller
Voice interactions originate from telephony systems such as contact centers, PBXs, IVRs, SIP trunks, or SBCs callers. These systems communicate with Druid SIP using standard SIP signaling and RTP audio streams.
Druid SIP (VoIP Gateway)
Druid SIP acts as the entry point for all voice interactions.
Druid AI Platform
After receiving the audio stream, Druid SIP routes the conversation to the Druid AI Platform, where:
- The Language Understanding Engine extracts intent and context.
- The Flow Engine executes business logic and conversation flows.
- The AI Agent generates responses.
- Conversation state is maintained throughout the interaction.
For all Druid SIP conversations, the platform automatically sets [[ChatUser]].ChannelId = "druid-sip". This allows AI Agents to apply channel-specific logic when required.
Speech-to-Text Services
Incoming audio is sent to the configured Speech-to-Text provider, which converts spoken language into text before it is processed by the AI Agent. Supported providers are documented in the TTS and STT Vendors topic.
LLM Services (optional)
When generative AI capabilities are required, the AI Agent can call one or more configured Large Language Model (LLM) providers. LLM calls are optional and depend on the the AI Agent setup. Supported providers are documented in the LLM Resource Management and Governance topic.
Text-to-Speech Services
After the response is generated, the text is sent to the configured Text-to-Speech provider, which converts the response into audio and returns it to Druid SIP for playback to the caller. Supported providers are documented in the TTS and STT Vendors topic.
End-to-End Call Flow
The caller initiates a voice interaction through a SIP-enabled telephony platform.
- A SIP session is established with Druid SIP.
- RTP audio is streamed to Druid SIP.
- Druid SIP processes voice activity and DTMF events.
- Audio is sent to the configured Speech-to-Text provider.
- The transcribed text is processed by the Flow Engine.
- The AI Agent optionally invokes LLM services.
- A response is generated and sent to the configured Text-to-Speech provider.
- The synthesized audio is streamed back through Druid SIP.
- The caller hears the AI Agent response.